25:00
Workshop FF UK 5.10.2023 1️⃣
Vybrané kapitoly z analýzy dat
LMU Munich
📫 renata.topinkova[at]lmu.de
1️⃣ What are APIs?
2️⃣ API with R package 💻
3️⃣ API without dedicated R package 💻
4️⃣ Web scraping
5️⃣ Web scraping practice 💻
All materials available at: https://github.com/renatatopinkova/2023_scraping_FFUK
https://github.com/renatatopinkova/2023_scraping_FFUK
API = Application Programming Interfaces
Source: https://www.geeksforgeeks.org/what-is-an-api/
✅ Legal (mostly)
✅ Structured data
✅ More robust to changes on the web
✅ There may be an R package!
❌ May not be available
❌ May need authentication
❌ Rate limits
❌ May be paid only
❌ Quality of documentation varies
❌ Can be cancelled anytime
On April 4, 2018, the post-API age reached a milestone. On that day, Facebook closed access to its Pages API, which had allowed researchers to extract posts, comments, and associated metadata from public Facebook pages (Schroepfer, 2018). This decision followed the company’s April 2015 closure of its public search Application Programming Interface (API), which provided searchable access to all public posts within a rolling two week window (Facebook, n.d.). The closure of the Pages API eliminated all terms of service (TOS)-compliant access to Facebook content. Let me underscore the magnitude of this shift: There is currently no way to independently extract content from Facebook without violating its TOS.(Freelon 2018)
Meta: No API access for researchers to Facebook and Instagram
Twitter: as of Feb 2023, paid only,
TikTok
Pushshift API (access to (historical) Reddit data), closed as of May 2023.
📱Social media
Youtube
Spotify
📰 News
Guardian
NYT
📊 Data sources
OECD
WHO
🌍 Gov
Data.police.uk
Covid data
🏴☠️ Unofficial “APIs”
Omdb
Google Trends
➿ Other
GoogleMaps
AccuWeather
Isn’t there an R package for that?
📦 WHO, guardianapi, spotifyR, nytimes, wbstats, RedditExtractoR
Are you sure?
Google, Github
If you’re SURE sure… Generic package
📦 httr, httr2
📖 STEP 1 : Read the documentation
Endpoint = designated point for data collection (often > 1)
Parameters = How can I narrow down what I want to get? What can I get? What values does the API accept?
Authentication = Do I need API token? How do I get it? Where do I put it?
Rate limits = How much can I download in a minute/day?
ToS = What are you allowed to do with the data? Can you publish it? In what form?
If you are using a package 📦, read the package vignette 📖
For the API: Can be found on the Developer Platforms in Docs
Google: [name of platform] API
Not always known
Exceeding rate limits can lead to being blocked or throttling
This is usually solved with Sys.sleep in R, but some packages include it within queries
Project name, e.g.: API retrieval class
After this, receive an email -> verify your address ->receive a key
Store the key as a .txt document in your project directory.
Open the 01_API_w_package_exercise.qmd file.
25:00
Webscraping v R 2023 - Renata Topinkova